Back

Journal of Computational Biology

SAGE Publications

Preprints posted in the last 90 days, ranked by how well they match Journal of Computational Biology's content profile, based on 37 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Identifying Robust Subclonal Structures through Tumor Progression Tree Alignment

Gilbert, J.; Wu, C. H.; Knittel, H.; Schäffer, A. A.; Malikic, S.; Sahinalp, C.

2026-02-27 cancer biology 10.64898/2026.02.25.708046 medRxiv
Top 0.1%
12.4%
Show abstract

Understanding and comparing tumor evolutionary histories is fundamental to cancer genomics. Clonal trees, used to model tumor progression, are rooted, unordered trees in which each node represents a subclone labeled by a set of distinct mutations. To compare two clonal trees, we introduce omlta, the optimal multi-label tree alignment, which removes the minimum number of mutation labels from the trees, so that the remaining trees are isomorphic. Computing omlta is NP-hard. Here, we present an algorithm to compute the omlta, with a running time of [Formula] where L [≥] 1 is the total number of mutation labels occurring in the input trees and k is the minimum possible number of mutation labels that need to be removed for the alignment. Our implementation (https://github.com/algo-cancer/omlta) is the first computational tool for determining the optimal alignment between clonal trees. We applied omlta to 126 cases from the TRACERx study on non-small cell lung cancers and some melanoma single-cell data.

2
On the consistency of duplication, loss, and deep coalescence gene tree parsimony costs under the multispecies coalescent

Sapoval, N.; Nakhleh, L.

2026-02-20 bioinformatics 10.64898/2026.02.20.707019 medRxiv
Top 0.1%
6.7%
Show abstract

Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.

3
k-Nearest Common Leaves algorithm for phylogenetic tree completion

Koshkarov, A.; Tahiri, N.

2026-04-04 evolutionary biology 10.64898/2026.04.02.716144 medRxiv
Top 0.1%
6.5%
Show abstract

Phylogenetic trees represent the evolutionary histories of taxa and support tasks such as clustering and Tree of Life reconstruction. Many established comparison methods, including the Robinson-Foulds (RF) distance, assume identical taxon sets. A methodological gap remains for trees with distinct but overlapping taxa. Existing approaches either prune non-common leaves, which can discard information, or complete both trees such that they share the same taxa. Completion is more comprehensive, but current methods typically ignore branch lengths, which are essential for identifying evolutionary patterns. This paper introduces k-Nearest Common Leaves (k-NCL), an algorithm for completing rooted phylogenetic trees defined on different but overlapping taxa. The method uses branch lengths and topological characteristics and does not rely on a specific distance measure. The k-NCL algorithm is designed to preserve evolutionary relationships in the trees under comparison. The running time is O(n2), where n is the size of the union of the two leaf sets. Additional properties include preservation of original distances and topology, symmetry, and uniqueness of the completion. Implemented in Python, k-NCL is evaluated on biological datasets of amphibians, birds, mammals, and sharks. Experimental results show that RF combined with k-NCL improves phylogenetic tree clustering performance compared to the RF(+) tree completion approach. Availability and implementationAn open-source implementation of k-NCL in Python and the datasets used in this study are available at https://github.com/tahiri-lab/KNCL.

4
PaNDA: Efficient Optimization of Phylogenetic Diversity in Networks

Holtgrefe, N.; van Iersel, L.; Meuwese, R.; Murakami, Y.; Schestag, J.

2026-02-25 bioinformatics 10.1101/2025.11.14.688467 medRxiv
Top 0.1%
4.2%
Show abstract

Phylogenetic diversity plays an important role in biodiversity, conservation, and evolutionary studies by measuring the diversity of a set of taxa based on their phylogenetic relationships. In phylogenetic trees, a subset of k taxa with maximum phylogenetic diversity can be found by a simple and efficient greedy algorithm. However, this algorithmic tractability is lost when considering phylogenetic networks, which incorporate reticulate evolutionary events such as hybridization and horizontal gene transfer. To address this challenge, we introduce PaNDA (Phylogenetic Network Diversity Algorithms), the first software package and interactive graphical user-interface for exploring, visualizing and maximizing diversity in phylogenetic networks. PaNDA includes a novel algorithm to find a subset of k taxa with maximum diversity, running in polynomial time for networks of bounded scanwidth, a measure of tree-likeness of a network that grows slower than the well-known level measure. This algorithm considers the variant of phylogenetic diversity on networks in which the branch lengths of all paths from the root to the selected taxa contribute towards their diversity. We demonstrate the scalability of this algorithm on simulated networks, successfully analyzing level-15 networks with up to 200 taxa in seconds. We also provide a proof-of-concept analysis using a phylogenetic network on Xiphophorus species, illustrating how the tool can support diversity studies based on real genomic data. The software is easily installable and freely available at https://github.com/nholtgrefe/panda. Additionally, we extend the definition of phylogenetic diversity to semi-directed phylogenetic networks, which are mixed graphs increasingly used in phylogenetic analysis to model uncertainty of the root location. We prove that finding a subset of k taxa with maximum diversity remains NP-hard on semi-directed networks, but do present a polynomial-time algorithm for networks with bounded level.

5
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 0.1%
3.7%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

6
STEQ: A statistically consistent quartet distance based species tree estimation method

Saha, P.; Saha, A.; Roddur, M. S.; Sikdar, S.; Anik, N. H.; Reaz, R.; Bayzid, M. S.

2026-03-02 bioinformatics 10.64898/2026.02.27.708511 medRxiv
Top 0.1%
3.7%
Show abstract

Accurate estimation of large-scale species trees from multilocus data in the presence of gene tree discordance remains a major challenge in phylogenomics. Although maximum likelihood, Bayesian, and statistically consistent summary methods can infer species trees with high accuracy, most of these methods are slow and not scalable to large number of taxa and genes. One of the promising ways for enabling large-scale phylogeny estimation is distance based estimation methods. Here, we present STEQ, a new statistically consistent, fast, and accurate distance based method to estimate species trees from a collection of gene trees. We used a quartet based distance metric which is statistically consistent under the multi-species coalescent (MSC) model. The running time of STEQ scales as [O] (kn2 log n), for n taxa and k genes, which is asymptotically faster than the leading summary based methods such as ASTRAL. We evaluated the performance of STEQ in comparison with ASTRAL and wQFM-TREE - two of the most popular and accurate coalescent-based methods. Experimental findings on a collection of simulated and empirical datasets suggest that STEQ enables significantly faster inference of species trees while maintaining competitive accuracy with the best current methods. STEQ is publicly available at https://github.com/prottoysaha99/STEQ.

7
Tracking cancer dynamics from normal tissue to malignancy using perfect N- and T-gene expression markers

Perez, G. J. G.; Perez-Rodriguez, R.; Gonzalez, A.

2026-03-08 cancer biology 10.1101/2024.11.04.621130 medRxiv
Top 0.1%
3.6%
Show abstract

Common knowledge states that the spontaneous somatic evolution of a normal tissue may lead to a tumor. Once the tumor is formed, it naturally evolves towards a state of higher malignancy. On the other hand, perfect gene expression markers for normal tissue and tumor--the so-called N-genes and T-genes--were recently introduced. We join these two pieces of knowledge in order to argue that: 1) Only N-markers participate in the spontaneous dynamics of a normal tissue. The number of active markers decreases as the tissue approaches the transition point where it becomes a tumor. 2) Only T-markers participate in the spontaneous dynamics of tumors. The number of markers increases as the tumor becomes more malignant. 3) Both sets of genes are connected by the so-called NT-genes, i.e., genes that are simultaneously N- and T-markers. They should play a crucial role at the transition point and, possibly, when the tumor is exposed to a drug or therapy. 4) The pathways or mechanisms protecting the normal tissue from becoming a tumor may be described by a small perfect panel of N-genes. 5) The pathways or mechanisms guiding the evolution of tumors in a tissue may be described by a small perfect panel of T-genes. We illustrate the above statements with the analysis of expression data for prostate adenocarcinoma, one of the most heterogeneous tumors. In this case, there are about 1000 N-genes and 6000 T-genes, and the perfect N- and T-panels contain 11 and 8 genes, respectively. Additionally, we provide examples from lung adenocarcinoma and liver hepatocarcinoma.

8
Model selection in ADMIXTURE can be inconsistent: proof of the K=2 phenomenon

Do, D.; Terhorst, J.

2026-03-02 evolutionary biology 10.64898/2026.02.27.708651 medRxiv
Top 0.1%
3.5%
Show abstract

STRUCTURE and ADMIXTURE are two popular methods for detecting population structure in genetic data. They model observed genotypes as mixtures of latent ancestral populations, and the inferred admixture proportions can be used to visualize and summarize population structure. A key parameter in these models is the number of ancestral populations, K. Selecting K is a challenging problem. Perhaps the most widely used method is Evannos {Delta}K, which selects K based on the second-order change in log-likelihood as K increases. However, practitioners have often noted that {Delta}K often favors overly small K, frequently returning K = 2 even when more meaningful substructure is present. In this paper, we provide a theoretical explanation for this phenomenon: we prove that, under certain conditions, the {Delta}K method can be inconsistent, meaning that it can fail to identify the true number of populations even with infinite data.

9
Estimating Bayesian phylogenetic information content using geodesic distances

Milkey, A.; Lewis, P. O.

2026-04-01 evolutionary biology 10.64898/2026.03.31.715656 medRxiv
Top 0.1%
3.2%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWA new Bayesian measure of phylogenetic information content is introduced based on geodesic distances in treespace. The measure is based on the relative variance of phylogenetic trees sampled from the posterior distribution compared to the prior distribution. This ratio is expected to equal 1 if there is no information in the data about phylogeny and 0 if there is complete information. Trees can be scaled to have the same mean tree length to avoid dominance by edge length information and focus on topological information. The method scales well, requiring only that a valid sample can be obtained from both prior and posterior distributions. We show how dissonance (information conflict) among data sets can also be estimated. Both simulated and empirical examples are provided to illustrate that the new approach produces sensible and intuitive results.

10
Using Variable Window Sizes for Phylogenomic Analyses of Whole Genome Alignments

Ivan, J.; Lanfear, R.

2026-03-06 bioinformatics 10.64898/2026.03.04.709403 medRxiv
Top 0.1%
2.8%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMany phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature of recombination as well as the variation in recombination rates along chromosomes. In this study, we extend the approach of Ivan et al. (2025) to allow window sizes to vary across the chromosome, using a splitting-and-merging strategy that allows for each window to be of an arbitrary length. We showed that the new method outperformed the fixed-window approach in recovering gene tree topologies on a wide range of simulated datasets. Applying the new method on the genomes of seven Heliconius butterflies, we found that the average window sizes for the group ranged between 538-808bp, but with a very similar distribution of gene tree topologies compared to previous studies that used fixed window sizes. For the genomes of great apes, the average window sizes ranged from 4.2kb to 6.2kb, with the proportion of the major topology (i.e., grouping human and chimpanzee together) reaching approximately 80%. In conclusion, our study highlights the limitations of using a fixed window size when recombination rates vary across the chromosomes, and proposes a splitting-and-merging approach that allows for variable window sizes across whole genome alignments.

11
Deconvolving Phylogenetic Distance Mixtures

Arasti, S.; Sapci, A. O. B.; Rachtman, E.; El-Kebir, M.; Mirarab, S.

2026-01-21 evolutionary biology 10.64898/2026.01.18.700179 medRxiv
Top 0.1%
2.7%
Show abstract

Mixtures of multiple constituent organisms are sequenced in several widely used applications, including metagenomics and metabarcoding. Characterizing the elements of the sequence mixture and their abundance with respect to a reference set of known organisms has been the subject of intense research across several domains, including microbiome analyses, and methods must overcome two key challenges. First, the mixture constituents are related to each other through an evolutionary history, and hence, should not be considered independent entities. Second, sequence data is noisy, with each short read providing a limited signal. While existing approaches attempt to address these challenges, addressing both challenges simultaneously has proved challenging. For evolutionary dependencies, methods either define hierarchical clusters (e.g., taxonomies or operational taxonomic/genomic units) or use phylogenetic trees. For the second challenge, they either assemble reads into contigs, use statistical priors to summarize read placements, or attempt to analyze all reads jointly using k-mers. Despite this rich literature, a natural approach to simultaneously address both challenges has been underexplored: compute a distance from the mixture to all references, deconvolve those distances, and place the sample on multiple branches of a reference phylogeny with associated abundances. This multi-placement approach is a natural extension of the single-read phylogenetic placement used in practice. We argue that by placing the entire sample on multiple branches instead of placing reads individually, we can obtain a less noisy profile of the mixture. We formalize this approach as the phylogenetic distance deconvolution (PDD) problem, show some limits on the identifiability of PDDs, propose a slow exact algorithm, and an efficient heuristic greedy algorithm with local refinements. Benchmarking shows that these heuristics are effective and that our implementation of the PDD approach (called DecoDiPhy) can accurately deconvolve phylogenetic mixture distances while scaling quadratically. Applied to metagenomics, DecoDiPhy consolidates reads mapped to a large number of branches on a reference tree to a much smaller number of placements. The consolidated placements improve the accuracy of downstream tasks, such as sample differentiation and detection of differentially abundant taxa.

12
On the Comparison of LGT networks and Tree-based Networks

Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.

2026-04-01 bioinformatics 10.1101/2025.11.20.689557 medRxiv
Top 0.1%
2.7%
Show abstract

Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557

13
Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity

Parmigiani, L.; Peterlongo, P.

2026-03-18 bioinformatics 10.64898/2026.03.16.711983 medRxiv
Top 0.1%
2.1%
Show abstract

A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.

14
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.1%
1.9%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

15
POTTR: Identifying Recurrent Trajectories in Evolutionary and Developmental Processes using Posets

Käufler, S. C.; Schmidt, H.; Jürgens, M.; Klau, G. W.; Sashittal, P.; Raphael, B.

2026-02-26 bioinformatics 10.64898/2026.02.25.707960 medRxiv
Top 0.2%
1.7%
Show abstract

Multiple biological processes, including cancer evolution and organismal development, are described as a sequence of events with a temporal ordering. While cancer evolves independently in each patient, DNA sequencing studies have shown that in some cancers different patients share specific orders of mutations and these correlate with distinct morphology, drug response, and treatment outcomes. Several methods have been developed to identify such recurrent trajectories of genetic events from phylogenetic trees, but this is complicated by high intra- and inter-tumor heterogeneity as well as uncertainty in the inferred tumor phylogenies including the ambiguous orders between some mutations. We formalize the problem of finding recurrent mutation trajectories using a novel framework of incomplete partially ordered sets (posets), which generalize representations used in previous works and explicitly account for the uncertainty in tumor phylogenies. We define the problem of identifying the largest recurrent trajectories shared in at least k input phylogenies as the maximum k-common induced incomplete subposet (MkCIIS) problem, which we show is NP-hard. We present a combinatorial algorithm, POsets for Temporal Trajectory Resolution (POTTR), to solve the MkCIIS problem using a conflict graph that models recurrent trajectories as independent sets. Thereby we identify maximum recurrent trajectories while resolving multiple sources of uncertainty, like mutation clusters, in the phylogenetic data. We apply POTTR to TRACERx non-small cell lung cancer bulk sequencing and acute myeloid leukemia single-cell sequencing data and through resolution of mutation clusters discover previously unreported trajectories of high statistical significance. On lineage tracing data of an in vitro embryoid model, POTTR identifies conserved differentiation routes across biological replicates and how these routes change in response to chemical perturbations.

16
A covarion model for phylogenetic estimation using discrete morphological datasets

Khakurel, B.; Hoehna, S.

2026-02-20 evolutionary biology 10.1101/2025.06.20.660793 medRxiv
Top 0.2%
1.7%
Show abstract

AbstractThe rate of evolution of a single morphological character is not homogeneous across the phylogeny and this rate heterogeneity varies between morphological characters. However, traditional models of morphological character evolution often assume that all characters evolve according to a time-homogeneous Markov process, which applies uniformly across the entire phylogeny. While models incorporating amongcharacter rate variation alleviate the assumption of the same rate for all characters, they still fail to address lineage-specific rate variation for individual characters. The covarion model, originally developed for molecular data to model the invariability of some sites for parts of the phylogeny, provides a promising framework for addressing this issue in morphological phylogenetics. In this study, we extend the covarion model in RevBayes to morphological character evolution, which we call the covariomorph model, and apply it to a diverse range of morphological datasets. Our covariomorph model utilizes multiple rate categories derived from a discretized probability distribution, which scales rate matrices accordingly. Characters are allowed to evolve within any of these rate categories, with the possibility of switching between rate categories during the evolutionary process. We verified our implementation of the covariomorph model with the help of simulations. Additionally, we examined 164 empirical datasets, finding patterns of rate heterogeneity compatible with covarion-like dynamics in approximately half of them. Upon further examination of two focal datasets that exhibited covarion-like rate variation, we found that the covariomorph model provides a more nuanced approach to incorporate rate variation across lineages, significantly affecting the resulting tree topology and branch lengths compared to traditional models. The observed sensitivity of branch lengths to model choice underscores potential implications of this approach for divergence time estimation and evolutionary rate calculations. By accounting for lineageand character-specific rate shifts, the covariomorph model offers a robust framework to improve the accuracy of morphological phylogenetic inference.

17
A New Information Theoretic Approach Shows that Mixture Models Outperform Partitioned Models for Phylogenetic Analyses of Amino Acid Data

Ren, H.; Jiang, C.; Wong, T. K. F.; Shao, Y.; Susko, E.; Minh, B. Q.; Lanfear, R.

2026-03-18 evolutionary biology 10.64898/2026.03.16.712229 medRxiv
Top 0.2%
1.7%
Show abstract

Partitioned and mixture models are widely employed in Maximum Likelihood phylogenetic analyses of large genomic datasets. Comparing the fit of the two types of models has been challenging, because standard information-theoretic approaches cannot be applied. Mixture models are increasingly popular for the analysis of amino acid datasets and can lead to different conclusions compared to partitioned models. This raises an important question - which type of model tends to perform better? Susko et al. (2026) recently introduced the marginal Akaike information criterion (mAIC), which allows mixture models and partitioned models to be directly compared for the first time. Here, we use the mAIC and a range of other approaches to compare the fit of mixture and partitioned models across a diverse set of empirical datasets. We show that mixture models are universally favoured on amino acid datasets. This has important implications for interpreting empirical analyses and suggests that continued development of mixture models is an important avenue for future research.

18
Ancestral state reconstruction with discrete characters using deep learning

Nagel, A. A.; Landis, M. J.

2026-03-21 evolutionary biology 10.64898/2026.03.19.712918 medRxiv
Top 0.2%
1.7%
Show abstract

Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.

19
An abstract model of nonrandom, non-Lamarckian mutation in evolution using a multivariate estimation-of-distribution algorithm

Vasylenko, L.; Livnat, A.

2026-04-01 evolutionary biology 10.64898/2026.03.30.715341 medRxiv
Top 0.2%
1.7%
Show abstract

At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwins observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.

20
On the correctness of gene tree tagging and the consistency of ASTRAL-pro under a unified model of gene duplication, loss, and coalescence

Parsons, R.; Liu, Y.; Dua, P.; Markin, A.; Molloy, E.

2026-01-21 bioinformatics 10.64898/2026.01.20.700722 medRxiv
Top 0.2%
1.5%
Show abstract

MotivationASTRAL-pro is the leading method for reconstructing species trees under complex evolutionary scenarios involving gene duplication, loss, and coalescence. A major open question is whether ASTRAL-pro is statistically consistent under a unified model of these processes, called DLCoal. This question is challenging to address because ASTRAL-pro seeks a species tree that maximizes the number of four-taxon trees (called quartets) also displayed by the input (multi-copy) gene trees, excluding those induced by duplications and agglomerating those that are homeomorphic up to duplications. Critically, there is no notion of correctness when tagging gene tree vertices as duplication or speciation events in the context of deep coalescence. ResultsHere, we propose that a gene tree vertex is correctly tagged as a duplication if it is the most recent common ancestor of at least one pair of gene copies related via a duplication event. Under our definition, deep coalescence propagates duplication tags across gene tree vertices, sometimes resulting in the exclusion of quartets on orthologous gene copies. Nevertheless, we show that A-pro is statistically consistent under the DLCoal model for an exclusion-only version of its objective function, assuming the input gene trees are correctly rooted and tagged. To empirically evaluate this modification, we exclude "duplication quartets" in the related method TREE-QMC and find that it achieves similar accuracy to A-pro on simulated data under varying rates of deep coalescence, duplication and loss, and gene tree estimation error, as well as on a plant data set. Availability and ImplementationTREE-QMC-pro is available on Github: https://github.com/molloy-lab/TREE-QMC/tree/tqmc-pro.